We will analyze the “IBM HR Analytics Employee Attrition & Performance” dataset ((link)[https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset]). The dataset documentation says that it was synthesized by data scientists at IBM. Observations describe hypothetical employees. Rows measure the employee retention status, metrics about the workplace, and details about the employee.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(leaps)) # model selection tools
suppressPackageStartupMessages(library(faraway)) # VIF function
suppressPackageStartupMessages(library(MASS)) # Box-Cox
source('./lib/fmrhs.R') # requires tidyverse
source('./lib/Pretty correlation.R')
We load the data using a custom column mapping. A few considerations apply:
# ordered factor
ordered_factor = col_factor(levels = 1:5, ordered = F)
# column mappings
types = cols(
# continuous variables
Age = col_double(), DailyRate = col_double(),
DistanceFromHome = col_double(), HourlyRate = col_double(),
#JobLevel = col_double(),
JobLevel = col_factor(),
# ordered factors we later recode
# e.g., JobSatisfaction ranges from 1 - 4 (= 'Low' to 'Very High')
Education = col_factor(levels = 1:5, ordered = F),
EnvironmentSatisfaction = col_factor(1:4, ordered = F),
JobInvolvement = col_factor(1:4, ordered = F),
JobSatisfaction = col_factor(levels = 1:4, ordered = F),
PerformanceRating = col_factor(levels = 1:4, ordered = F),
RelationshipSatisfaction = col_factor(levels = 1:4, ordered = F),
WorkLifeBalance = col_factor(levels = 1:4, ordered = F),
# other factors
BusinessTravel = col_factor(levels = c('Non-Travel', 'Travel_Rarely', 'Travel_Frequently'),
ordered = F),
Gender = col_factor(), JobRole = col_factor(),
MaritalStatus = col_factor(), EducationField = col_factor(),
# things we have decided to treat as factors
Department = col_factor(), # we presume we could map to a taxonomy
StockOptionLevel = col_factor(), # unclear scale
# true/false
Attrition = col_factor(),
OverTime = col_factor(),
# drop
EmployeeCount = col_skip(), # always 1
StandardHours = col_skip(), # always 80
EmployeeNumber = col_skip(), # 1, 2, ...
Over18 = col_skip(), # always true
# continous values
MonthlyIncome = col_double(), MonthlyRate = col_double(),
NumCompaniesWorked = col_double(), PercentSalaryHike = col_double(),
TotalWorkingYears = col_double(), TrainingTimesLastYear = col_double(),
WorkLifeBalance = col_double(), YearsAtCompany = col_double(),
YearsInCurrentRole = col_double(), YearsSinceLastPromotion = col_double(),
YearsWithCurrManager = col_double()
)
# read in using mapping
ibm = read_csv("data/ibm.csv", col_types = types)
# rename levels
ibm$Education = recode(ibm$Education,
'1' = 'Below College', '2' = 'College', '3' = 'Bachelor', '4' = 'Master', '5' = 'Doctor')
ibm$EnvironmentSatisfaction = recode(ibm$EnvironmentSatisfaction,
'1' = 'Low', '2' = 'Medium', '3' = 'High', '4' = 'Very High')
ibm$JobInvolvement = recode(ibm$JobInvolvement,
'1' = 'Low', '2' = 'Medium', '3' = 'High', '4' = 'Very High')
ibm$JobSatisfaction = recode(ibm$JobSatisfaction,
'1' = 'Low', '2' = 'Medium', '3' = 'High', '4' = 'Very High')
ibm$PerformanceRating = recode(ibm$PerformanceRating,
'1' = 'Low', '2' = 'Good', '3' = 'Excellent', '4' = 'Outstanding')
ibm$RelationshipSatisfaction = recode(ibm$RelationshipSatisfaction,
'1' = 'Low', '2' = 'Medium', '3' = 'High', '4' = 'Very High')
ibm$WorkLifeBalance = recode(ibm$WorkLifeBalance,
'1' = 'Bad', '2' = 'Good', '3' = 'Better', '4' = 'Best')
# set contrasts for yes/no levels
ibm$Attrition = relevel(ibm$Attrition, ref = "No")
ibm$OverTime = relevel(ibm$OverTime, ref = "No")
We produce a defactored version of the data for which we have replaced factors with numeric vectors.
ibm.defactored = mutate_if(ibm, is.factor, ~ as.numeric(.x))
We show some sample data:
# a sample observation
t(head(ibm, n = 1))
[,1]
Age "41"
Attrition "Yes"
BusinessTravel "Travel_Rarely"
DailyRate "1102"
Department "Sales"
DistanceFromHome "1"
Education "College"
EducationField "Life Sciences"
EnvironmentSatisfaction "Medium"
Gender "Female"
HourlyRate "94"
JobInvolvement "High"
JobLevel "2"
JobRole "Sales Executive"
JobSatisfaction "Very High"
MaritalStatus "Single"
MonthlyIncome "5993"
MonthlyRate "19479"
NumCompaniesWorked "8"
OverTime "Yes"
PercentSalaryHike "11"
PerformanceRating "Excellent"
RelationshipSatisfaction "Low"
StockOptionLevel "0"
TotalWorkingYears "8"
TrainingTimesLastYear "0"
WorkLifeBalance "Bad"
YearsAtCompany "6"
YearsInCurrentRole "4"
YearsSinceLastPromotion "0"
YearsWithCurrManager "5"
head(ibm, n = 10)[,c(1:5)]
There are 1,470 rows and 31 columns.
par(mfrow=c(2,2))
plot(ibm$EducationField)
plot(ibm$JobInvolvement)
plot(ibm$Education)
with(ibm, hist(Age))
summary(ibm)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education
Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0 Sales :446 Min. : 1.000 Below College:170
1st Qu.:30.00 Yes: 237 Travel_Rarely :1043 1st Qu.: 465.0 Research & Development:961 1st Qu.: 2.000 College :282
Median :36.00 Travel_Frequently: 277 Median : 802.0 Human Resources : 63 Median : 7.000 Bachelor :572
Mean :36.92 Mean : 802.5 Mean : 9.193 Master :398
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 Doctor : 48
Max. :60.00 Max. :1499.0 Max. :29.000
EducationField EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
Life Sciences :606 Low :284 Female:588 Min. : 30.00 Low : 83 2:534
Other : 82 Medium :287 Male :882 1st Qu.: 48.00 Medium :375 1:543
Medical :464 High :453 Median : 66.00 High :868 3:218
Marketing :159 Very High:446 Mean : 65.89 Very High:144 4:106
Technical Degree:132 3rd Qu.: 83.75 5: 69
Human Resources : 27 Max. :100.00
JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
Sales Executive :326 Low :289 Single :470 Min. : 1009 Min. : 2094 Min. :0.000 No :1054
Research Scientist :292 Medium :280 Married :673 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416
Laboratory Technician :259 High :442 Divorced:327 Median : 4919 Median :14236 Median :2.000
Manufacturing Director :145 Very High:459 Mean : 6503 Mean :14313 Mean :2.693
Healthcare Representative:131 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
Manager :102 Max. :19999 Max. :26999 Max. :9.000
(Other) :215
PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear
Min. :11.00 Low : 0 Low :276 0:631 Min. : 0.00 Min. :0.000
1st Qu.:12.00 Good : 0 Medium :303 1:596 1st Qu.: 6.00 1st Qu.:2.000
Median :14.00 Excellent :1244 High :459 3: 85 Median :10.00 Median :3.000
Mean :15.21 Outstanding: 226 Very High:432 2:158 Mean :11.28 Mean :2.799
3rd Qu.:18.00 3rd Qu.:15.00 3rd Qu.:3.000
Max. :25.00 Max. :40.00 Max. :6.000
WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Bad : 80 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
Good :344 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
Better:893 Median : 5.000 Median : 3.000 Median : 1.000 Median : 3.000
Best :153 Mean : 7.008 Mean : 4.229 Mean : 2.188 Mean : 4.123
3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :40.000 Max. :18.000 Max. :15.000 Max. :17.000
We look at columns for which the correlation is greater than 70 in the defactored data:
p = round(cor(ibm.defactored) * 100)
for(i in 1:nrow(p)) {
for(j in i:ncol(p)) {
if(i == j) next
val = p[i,j]
if(abs(val) > 70) {
r = row.names(p)[i]
c = colnames(p)[j]
str = paste("cor(", r, ", ", c, ") = ", val, "\n")
cat(str)
}
}
}
cor( JobLevel , MonthlyIncome ) = 76
cor( MonthlyIncome , TotalWorkingYears ) = 77
cor( PercentSalaryHike , PerformanceRating ) = 77
cor( YearsAtCompany , YearsInCurrentRole ) = 76
cor( YearsAtCompany , YearsWithCurrManager ) = 77
cor( YearsInCurrentRole , YearsWithCurrManager ) = 71
We also examine the VIFs of the defactored data for general interest. We perform a formal test when we investigate specific models.
model.defactored = lm(MonthlyIncome ~ ., data = ibm.defactored)
model.defactored.vifs = faraway::vif(model.defactored)
# Five highest VIFs with this "defactored" model on Monthly income
#
model.defactored.vifs[order(model.defactored.vifs, decreasing = T)][1:5]
YearsAtCompany TotalWorkingYears YearsWithCurrManager YearsInCurrentRole PercentSalaryHike
4.626309 3.922090 2.831112 2.754956 2.532532
Our first question is
What factors correlate with attrition?
We choose to investigate this using the leaps framework to evaluate possible variables to include. Given the fairly small data set, we perform an exhaustive search using regsubsets from the leaps package.
# this took about six minutes on one computer tested
system.time({
allreg = regsubsets(MonthlyIncome ~ ., ibm, nbest = 1, really.big = T)
})
2 linear dependencies found
Reordering variables and trying again:
user system elapsed
6.040 0.024 6.087
We now print out the top results for several criteria.
pp_allreg(allreg)
Maximized: r2 with value 0.947986365994346
Formula: (response) ~ 1 + JobLevel1 + JobLevel3 + JobLevel4 + JobLevel5 + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleManager + JobRoleSales Representative + JobRoleResearch Director
Maximized: adjr2 with value 0.9476657340039
Formula: (response) ~ 1 + JobLevel1 + JobLevel3 + JobLevel4 + JobLevel5 + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleManager + JobRoleSales Representative + JobRoleResearch Director
Minimized: mse with value 1159981.52574639
Formula: (response) ~ 1 + JobLevel1 + JobLevel3 + JobLevel4 + JobLevel5 + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleManager + JobRoleSales Representative + JobRoleResearch Director
Minimized: cp with value 66.1106375795891
Formula: (response) ~ 1 + JobLevel1 + JobLevel3 + JobLevel4 + JobLevel5 + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleManager + JobRoleSales Representative + JobRoleResearch Director
Minimized: bic with value -4272.75644465781
Formula: (response) ~ 1 + JobLevel1 + JobLevel3 + JobLevel4 + JobLevel5 + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleManager + JobRoleSales Representative + JobRoleResearch Director
Since the model with the extreme value for every criterion is the same in the exhautive search, we fit the specified model:
model.allreg = lm(MonthlyIncome ~ 1 + JobLevel + JobRole, data = ibm)
CORRECT ME.
The equation would be \[\begin{array}{rcl} \text{Monthly_Income} &=& \beta_0 + \beta_1 \times \text{BusinessTravel} + \beta_2 \times \text{EnvironmentSatisfaction} \\ && + \beta_3 \times \text{JobInvolvement} + \beta_4 \times \text{JobRole} + \beta_5 \times \text{JobSatisfaction} \\ && + \beta_6 \times \text{OverTime} + \beta_7 \times \text{StockOptionLevel} + \beta_8 \times \text{TotalWorkingYears} \end{array}\]
summary(model.allreg)
Call:
lm(formula = MonthlyIncome ~ 1 + JobLevel + JobRole, data = ibm)
Residuals:
Min 1Q Median 3Q Max
-2912.3 -662.0 -87.3 651.8 4253.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5762.72 64.63 89.161 < 2e-16 ***
JobLevel1 -1878.46 102.01 -18.415 < 2e-16 ***
JobLevel3 3474.85 92.98 37.370 < 2e-16 ***
JobLevel4 7439.74 154.88 48.037 < 2e-16 ***
JobLevel5 10089.59 202.41 49.848 < 2e-16 ***
JobRoleResearch Scientist -1029.30 120.58 -8.536 < 2e-16 ***
JobRoleLaboratory Technician -1115.24 120.58 -9.249 < 2e-16 ***
JobRoleManufacturing Director -59.07 107.22 -0.551 0.582
JobRoleHealthcare Representative 87.79 111.27 0.789 0.430
JobRoleManager 3328.58 177.16 18.789 < 2e-16 ***
JobRoleSales Representative -1416.68 162.62 -8.712 < 2e-16 ***
JobRoleResearch Director 3357.60 167.92 19.995 < 2e-16 ***
JobRoleHuman Resources -735.81 172.87 -4.256 2.21e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1071 on 1457 degrees of freedom
Multiple R-squared: 0.9487, Adjusted R-squared: 0.9483
F-statistic: 2246 on 12 and 1457 DF, p-value: < 2.2e-16
model.simple = lm(MonthlyIncome ~ JobLevel, data = ibm)
summary(model.simple)
Call:
lm(formula = MonthlyIncome ~ JobLevel, data = ibm)
Residuals:
Min 1Q Median 3Q Max
-4607.3 -713.2 -93.9 694.1 4495.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5502.28 55.80 98.61 <2e-16 ***
JobLevel1 -2715.36 78.58 -34.56 <2e-16 ***
JobLevel3 4314.98 103.63 41.64 <2e-16 ***
JobLevel4 10001.51 137.10 72.95 <2e-16 ***
JobLevel5 13689.55 164.94 83.00 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1289 on 1465 degrees of freedom
Multiple R-squared: 0.9252, Adjusted R-squared: 0.925
F-statistic: 4530 on 4 and 1465 DF, p-value: < 2.2e-16
We first check for multicollinearity:
faraway::vif(model.allreg)
JobLevel1 JobLevel3 JobLevel4 JobLevel5
3.108806 1.400625 2.058392 2.350623
JobRoleResearch Scientist JobRoleLaboratory Technician JobRoleManufacturing Director JobRoleHealthcare Representative
2.968278 2.706742 1.310810 1.289090
JobRoleManager JobRoleSales Representative JobRoleResearch Director JobRoleHuman Resources
2.599189 1.806889 1.861053 1.307837
plot(model.allreg, which = 1)
boxcox(model.allreg, lambda = seq(0.5, 1, by= 0.1))
model.allreg.transform = lm(MonthlyIncome^0.7 ~ JobLevel + JobRole, data = ibm)
summary(model.allreg.transform)
Call:
lm(formula = MonthlyIncome^0.7 ~ JobLevel + JobRole, data = ibm)
Residuals:
Min 1Q Median 3Q Max
-159.257 -35.323 -3.147 34.838 205.714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 426.643 3.387 125.952 < 2e-16 ***
JobLevel1 -111.099 5.346 -20.781 < 2e-16 ***
JobLevel3 170.382 4.873 34.963 < 2e-16 ***
JobLevel4 334.287 8.117 41.184 < 2e-16 ***
JobLevel5 430.551 10.608 40.587 < 2e-16 ***
JobRoleResearch Scientist -55.250 6.319 -8.743 < 2e-16 ***
JobRoleLaboratory Technician -59.876 6.320 -9.475 < 2e-16 ***
JobRoleManufacturing Director -3.300 5.619 -0.587 0.557
JobRoleHealthcare Representative 4.492 5.832 0.770 0.441
JobRoleManager 137.502 9.285 14.810 < 2e-16 ***
JobRoleSales Representative -79.963 8.523 -9.382 < 2e-16 ***
JobRoleResearch Director 140.291 8.801 15.941 < 2e-16 ***
JobRoleHuman Resources -40.182 9.060 -4.435 9.89e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 56.11 on 1457 degrees of freedom
Multiple R-squared: 0.936, Adjusted R-squared: 0.9354
F-statistic: 1775 on 12 and 1457 DF, p-value: < 2.2e-16
plot(model.allreg.transform, which = 1)
qqline(model.allreg.transform$residuals, col="red")
boxcox(model.allreg.transform)
plot(model.simple, which = 1)
boxcox(model.simple, lambda = seq(0.2, .6, by= 0.1))
#simple.transform = ibm
#simple.transform$MonthlyIncome = simple.transform$MonthlyIncome^0.5
model.simple.transform = lm(sqrt(MonthlyIncome) ~ JobLevel, data = ibm)
plot(model.simple.transform, which = 1)
{qqnorm(model.simple.transform$residuals, col="red")
qqline(model.simple.transform$residuals, col="red")}
boxcox(model.simple.transform)
summary(model.allreg.transform)
Call:
lm(formula = MonthlyIncome^0.7 ~ JobLevel + JobRole, data = ibm)
Residuals:
Min 1Q Median 3Q Max
-159.257 -35.323 -3.147 34.838 205.714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 426.643 3.387 125.952 < 2e-16 ***
JobLevel1 -111.099 5.346 -20.781 < 2e-16 ***
JobLevel3 170.382 4.873 34.963 < 2e-16 ***
JobLevel4 334.287 8.117 41.184 < 2e-16 ***
JobLevel5 430.551 10.608 40.587 < 2e-16 ***
JobRoleResearch Scientist -55.250 6.319 -8.743 < 2e-16 ***
JobRoleLaboratory Technician -59.876 6.320 -9.475 < 2e-16 ***
JobRoleManufacturing Director -3.300 5.619 -0.587 0.557
JobRoleHealthcare Representative 4.492 5.832 0.770 0.441
JobRoleManager 137.502 9.285 14.810 < 2e-16 ***
JobRoleSales Representative -79.963 8.523 -9.382 < 2e-16 ***
JobRoleResearch Director 140.291 8.801 15.941 < 2e-16 ***
JobRoleHuman Resources -40.182 9.060 -4.435 9.89e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 56.11 on 1457 degrees of freedom
Multiple R-squared: 0.936, Adjusted R-squared: 0.9354
F-statistic: 1775 on 12 and 1457 DF, p-value: < 2.2e-16
summary(model.simple.transform)
Call:
lm(formula = sqrt(MonthlyIncome) ~ JobLevel, data = ibm)
Residuals:
Min 1Q Median 3Q Max
-28.3972 -5.2171 -0.2344 5.0746 26.4043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.5857 0.3536 208.13 <2e-16 ***
JobLevel1 -21.2677 0.4979 -42.71 <2e-16 ***
JobLevel3 25.0771 0.6567 38.19 <2e-16 ***
JobLevel4 50.7090 0.8687 58.37 <2e-16 ***
JobLevel5 64.9366 1.0452 62.13 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.17 on 1465 degrees of freedom
Multiple R-squared: 0.9055, Adjusted R-squared: 0.9052
F-statistic: 3508 on 4 and 1465 DF, p-value: < 2.2e-16
\[ E(\varepsilon) = 0, \quad \sigma^2 \text{ constant}\]
regnull = lm(MonthlyIncome ~ 1, data = ibm)
regfull = lm(MonthlyIncome ~ ., data = ibm)
#step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
#step(regfull, scope=list(lower=regnull, upper=regfull), direction="forward")
step(regfull, scope=list(lower=regnull, upper=regfull), direction="both", trace = 0)
Call:
lm(formula = MonthlyIncome ~ Gender + JobInvolvement + JobLevel +
JobRole + NumCompaniesWorked + StockOptionLevel + TotalWorkingYears +
YearsInCurrentRole, data = ibm)
Coefficients:
(Intercept) GenderMale JobInvolvementMedium JobInvolvementHigh
5515.998 80.608 -199.391 -316.879
JobInvolvementVery High JobLevel1 JobLevel3 JobLevel4
-291.419 -1594.622 3254.474 6838.395
JobLevel5 JobRoleResearch Scientist JobRoleLaboratory Technician JobRoleManufacturing Director
9405.617 -1186.292 -1286.324 -87.914
JobRoleHealthcare Representative JobRoleManager JobRoleSales Representative JobRoleResearch Director
7.335 3349.003 -1474.261 3389.699
JobRoleHuman Resources NumCompaniesWorked StockOptionLevel1 StockOptionLevel3
-869.088 29.323 112.508 -124.918
StockOptionLevel2 TotalWorkingYears YearsInCurrentRole
-49.980 33.183 13.723
model.best = lm(MonthlyIncome ~ Gender + JobInvolvement + JobLevel + JobRole +
NumCompaniesWorked + StockOptionLevel + TotalWorkingYears +
YearsInCurrentRole, data = ibm)
plot(model.best, which = 1)
boxcox(model.best, lambda = seq(0.3, 1.2, 0.1))
#best.transform = ibm
#best.transform$MonthlyIncome = best.transform$MonthlyIncome^0.7
model.best.transform = lm(MonthlyIncome^0.7 ~ Gender + JobInvolvement + JobLevel + JobRole +
NumCompaniesWorked + StockOptionLevel + TotalWorkingYears +
YearsInCurrentRole, data = ibm )
plot(model.best.transform , which = 1)
boxcox(model.best.transform, lambda = seq(0.3, 1.2, 0.1))
transformDoubles = function(df, func) {
dbls = sapply(df, is.double)
copy = df
copy[,dbls] = func(df[,dbls])
copy
}
allregMonthlyIncome = ibm$MonthlyIncome^.7
allreg.transform.predictors = ibm[-c(17)]
## We tried 3 types of transformation to all of non factor predictors (ln, )
allreg.transform.factor = allreg.transform.predictors[,sapply(allreg.transform.predictors, is.factor)]
allreg.transform.double = allreg.transform.predictors[,sapply(allreg.transform.predictors, is.double)]
#allreg.transform.double = lapply(allreg.transform.double, function(x) 1/x)
#allreg.transform.double = exp(allreg.transform.double + 0.01)
allreg.transform.double = log(allreg.transform.double + 0.01)
allreg.transform.predictors = cbind(allreg.transform.factor, allreg.transform.double)
allreg.transform.predictors$MonthlyIncome = allregMonthlyIncome
# check function
res = transformDoubles(allreg.transform.predictors, function(x) { log(x + 0.01) })
NaNs producedNaNs producedNaNs producedNaNs producedNaNs producedNaNs producedNaNs produced
res$MonthlyIncome = ibm$MonthlyIncome^.7 # retain original
#dbls = sapply(allreg.transform.predictors, is.double)
#res[,names(dbls[dbls])] == allreg.transform.predictors[,names(dbls[dbls])]
#res[,names(allreg.transform.predictors)] == allreg.transform.predictors[,names(allreg.transform.predictors)]
model.allreg.transform.predictors = lm(MonthlyIncome ~ JobLevel + JobRole, data = allreg.transform.predictors)
plot(model.allreg.transform.predictors , which = 1)
boxcox(model.allreg.transform.predictors, lambda = seq(0.3, 1.2, 0.1))
simpleMonthlyIncome = ibm$MonthlyIncome^.5
simple.transform.predictors = ibm[-c(17)]
## We tried 3 types of transformation to all of non factor predictors (ln, )
simple.transform.factor = simple.transform.predictors[,sapply(simple.transform.predictors, is.factor)]
simple.transform.double = simple.transform.predictors[,sapply(simple.transform.predictors, is.double)]
#simple.transform.double = lapply(simple.transform.double, function(x) 1/x)
#simple.transform.double = exp(simple.transform.double + 0.01)
simple.transform.double = log(simple.transform.double + 0.01)
simple.transform.predictors = cbind(simple.transform.factor, simple.transform.double)
simple.transform.predictors$MonthlyIncome = simpleMonthlyIncome
model.simple.transform.predictors = lm(MonthlyIncome ~ JobLevel, data = simple.transform.predictors)
plot(model.simple.transform.predictors , which = 1)
boxcox(model.simple.transform.predictors, lambda = seq(0.3, 1.2, 0.1))
bestMontlyIncome = ibm$MonthlyIncome^.7
best.transform.predictors = ibm[-c(17)]
## We tried 3 types of transformation to all of non factor predictors
best.transform.factor = best.transform.predictors[,sapply(best.transform.predictors, is.factor)]
best.transform.double = best.transform.predictors[,sapply(best.transform.predictors, is.double)]
#best.transform.double = lapply(best.transform.double, function(x) 1/x)
#best.transform.double = exp(best.transform.double + 0.01)
best.transform.double = log(best.transform.double + 0.01)
best.transform.predictors = cbind(best.transform.factor, best.transform.double)
best.transform.predictors$MonthlyIncome = bestMontlyIncome
model.best.transform.predictors = lm(MonthlyIncome ~ Gender + JobInvolvement + JobLevel +
JobRole + NumCompaniesWorked + StockOptionLevel + TotalWorkingYears +
YearsInCurrentRole, data = best.transform.predictors)
plot(model.best.transform.predictors , which = 1)
boxcox(model.best.transform.predictors, lambda = seq(0.3, 1.2, 0.1))
# Transform Y since boxcox does not contain 1
best.transform.predictors$MonthlyIncome = best.transform.predictors$MonthlyIncome^.8
model.best.transform.predictors = lm(MonthlyIncome ~ Gender + JobInvolvement + JobLevel +
JobRole + NumCompaniesWorked + StockOptionLevel + TotalWorkingYears +
YearsInCurrentRole, data = best.transform.predictors)
plot(model.best.transform.predictors , which = 1)
boxcox(model.best.transform.predictors, lambda = seq(0.3, 1.2, 0.1))
summary(model.best.transform.predictors)
Call:
lm(formula = MonthlyIncome ~ Gender + JobInvolvement + JobLevel +
JobRole + NumCompaniesWorked + StockOptionLevel + TotalWorkingYears +
YearsInCurrentRole, data = best.transform.predictors)
Residuals:
Min 1Q Median 3Q Max
-43.928 -8.484 -0.294 8.124 49.795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 119.7846 1.9883 60.246 < 2e-16 ***
GenderMale 0.6936 0.7032 0.986 0.3241
JobInvolvementMedium -1.9870 1.5947 -1.246 0.2130
JobInvolvementHigh -3.5640 1.5109 -2.359 0.0185 *
JobInvolvementVery High -2.9269 1.8135 -1.614 0.1068
JobLevel1 -23.2972 1.3379 -17.413 < 2e-16 ***
JobLevel3 37.2868 1.1579 32.203 < 2e-16 ***
JobLevel4 69.5899 1.9555 35.587 < 2e-16 ***
JobLevel5 88.3766 2.5359 34.850 < 2e-16 ***
JobRoleResearch Scientist -15.2874 1.4899 -10.260 < 2e-16 ***
JobRoleLaboratory Technician -16.2473 1.4884 -10.916 < 2e-16 ***
JobRoleManufacturing Director -1.0880 1.3133 -0.828 0.4076
JobRoleHealthcare Representative 0.2405 1.3669 0.176 0.8604
JobRoleManager 29.5165 2.1727 13.585 < 2e-16 ***
JobRoleSales Representative -18.3760 1.9989 -9.193 < 2e-16 ***
JobRoleResearch Director 30.4136 2.0584 14.776 < 2e-16 ***
JobRoleHuman Resources -11.5227 2.1210 -5.433 6.5e-08 ***
NumCompaniesWorked 0.1800 0.1782 1.010 0.3127
StockOptionLevel1 0.9284 0.7553 1.229 0.2192
StockOptionLevel3 -1.2227 1.5194 -0.805 0.4211
StockOptionLevel2 -1.0964 1.1779 -0.931 0.3521
TotalWorkingYears 4.2845 0.4933 8.685 < 2e-16 ***
YearsInCurrentRole 0.3352 0.1681 1.994 0.0464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.09 on 1447 degrees of freedom
Multiple R-squared: 0.9342, Adjusted R-squared: 0.9332
F-statistic: 933.7 on 22 and 1447 DF, p-value: < 2.2e-16
##residuals
res = model.best.transform.predictors$residuals
##studentized residuals
student.res = rstandard(model.best.transform.predictors)
##externally studentized residuals
ext.student.res = rstudent(model.best.transform.predictors)
par(mfrow = c(1,3))
plot(model.best.transform.predictors$fitted.values,res,main="Residuals")
plot(model.best.transform.predictors$fitted.values,student.res,main="Studentized Residuals")
plot(model.best.transform.predictors$fitted.values,ext.student.res,main="Externally Studentized Residuals")
n = length(best.transform.predictors$MonthlyIncome)
p = length(coef(model.best.transform.predictors))
##critical value using Bonferroni procedure
ext.student.res[abs(ext.student.res)>qt(1-0.05/(2*n), n-p-1)]
named numeric(0)
# Leverage point
lev<-lm.influence(model.best.transform.predictors)$hat
length(lev[lev>2*p/n])
[1] 59
ibm[which(lev>2*p/n),]
plot(ibm$PerformanceRating)
DFFITS<-dffits(model.best.transform.predictors)
DFFITS[abs(DFFITS)>2*sqrt(p/n)]
34 51 62 86 89 100 123 140 153 154 189 213
-0.5164222 -0.3544993 -0.2753969 -0.3424822 -0.2668946 -0.3529784 0.3270479 0.2518419 -0.4118730 -0.3248535 0.3526946 0.2982517
254 263 312 336 344 391 425 441 458 472 487 511
0.2615659 -0.3398483 -0.3393454 0.2530077 0.2655421 -0.2555796 -0.2618717 0.3540281 0.2544505 0.4428963 0.3513636 0.3346399
512 523 530 552 557 565 630 758 783 787 790 871
0.2767958 0.2748183 0.2541577 0.2508164 -0.2907635 0.4020855 0.3006576 0.4161342 0.4504144 0.3548528 0.2621831 0.3557121
912 926 945 991 1013 1022 1036 1042 1064 1083 1087 1091
-0.3326712 -0.4204300 0.4038659 0.2901220 -0.2518345 0.2550441 -0.2617855 0.2510447 0.2700827 -0.3135081 -0.3254631 0.4748637
1107 1120 1126 1204 1212 1223 1244 1299 1316 1327 1366 1370
0.2800953 0.3410709 0.2993764 -0.2541930 0.4724197 -0.3679182 0.3478758 0.2852621 0.3070194 0.3166156 -0.2800686 0.3871367
1374 1395 1403 1443 1464 1466
-0.4022173 0.4448477 -0.3596750 0.3170717 0.4402406 -0.3441110
DFBETAS<-dfbetas(model.best.transform.predictors)
length(DFBETAS[abs(DFBETAS)>2/sqrt(n)])
[1] 1711
#length(DFBETAS)
nrow(ibm)
[1] 1470
length(model.best.transform.predictors$residuals)
[1] 1470
COOKS<-cooks.distance(model.best.transform.predictors)
COOKS[COOKS>qf(0.5,p,n-p)]
named numeric(0)
We now check for predictive performance.
length(model.best.transform.predictors$coefficients)
[1] 23
length(ibm[,1])
[1] 1
model.simple.rstudent = rstudent(model.simple)
n = nrow(ibm)
p = length(coef(model.simple))
plot(model.simple.rstudent,main="Externally Studentized Residuals", ylim=c(-4,4))
abline(h=qt(1-0.05/(2*n), n-p-1), col="red")
abline(h=-qt(1-0.05/(2*n), n-p-1), col="red")
model.simple.rstudent[abs(model.simple.rstudent)>qt(1-0.05/(2*n), n-p-1)]
named numeric(0)
lev = lm.influence(model.simple)$hat
influence = lev[lev>2*p/n]
{
plot(lev, main="Leverages", ylim=c(-0.8,0.8))
abline(h=2*p/n, col="red")
identify(lev)
}
integer(0)
ibm[which(lev>2*p/n),]
ibm.nosimplelev = ibm[-which(lev>2*p/n),]
model.simple.nolev = lm(MonthlyIncome ~ JobLevel, data = ibm.nosimplelev)
plot(model.simple.nolev)
plot(model.simple)
# Create subsets for each job level
jobLevel1 = subset(ibm, JobLevel == 1)
jobLevel2 = subset(ibm, JobLevel == 2)
jobLevel3 = subset(ibm, JobLevel == 3)
jobLevel4 = subset(ibm, JobLevel == 4)
jobLevel5 = subset(ibm, JobLevel == 5)
# using exhaustive search for job level 1
allreg.jobLevel1 <- regsubsets(MonthlyIncome ~., data=jobLevel1, nbest=1, really.big = T)
pp_allreg(allreg.jobLevel1)
Maximized: r2 with value 0.132988503672628
Formula: (response) ~ 1 + AttritionYes + JobRoleSales Representative + NumCompaniesWorked + PercentSalaryHike + PerformanceRatingOutstanding + TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole + YearsSinceLastPromotion
Maximized: adjr2 with value 0.11834853469149
Formula: (response) ~ 1 + AttritionYes + JobRoleSales Representative + NumCompaniesWorked + PercentSalaryHike + PerformanceRatingOutstanding + TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole + YearsSinceLastPromotion
Minimized: mse with value 180389.506969637
Formula: (response) ~ 1 + AttritionYes + JobRoleSales Representative + NumCompaniesWorked + PercentSalaryHike + PerformanceRatingOutstanding + TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole + YearsSinceLastPromotion
Minimized: cp with value -6.67944653776021
Formula: (response) ~ 1 + AttritionYes + JobRoleSales Representative + NumCompaniesWorked + PercentSalaryHike + PerformanceRatingOutstanding + TotalWorkingYears + TrainingTimesLastYear + YearsInCurrentRole + YearsSinceLastPromotion
Minimized: bic with value -29.9370125416857
Formula: (response) ~ 1 + AttritionYes + TotalWorkingYears
# using exhaustive search for job level 2
allreg.jobLevel2 <- regsubsets(MonthlyIncome ~., data=jobLevel2, nbest=1, really.big = T)
10 linear dependencies found
Reordering variables and trying again:
pp_allreg(allreg.jobLevel2)
Maximized: r2 with value 0.150548565355962
Formula: (response) ~ 1 + BusinessTravelTravel_Rarely + DailyRate + EducationCollege + EducationFieldMedical + JobInvolvementHigh + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleSales Representative + TotalWorkingYears
Maximized: adjr2 with value 0.135958750638793
Formula: (response) ~ 1 + BusinessTravelTravel_Rarely + DailyRate + EducationCollege + EducationFieldMedical + JobInvolvementHigh + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleSales Representative + TotalWorkingYears
Minimized: mse with value 616551.587036018
Formula: (response) ~ 1 + BusinessTravelTravel_Rarely + DailyRate + EducationCollege + EducationFieldMedical + JobInvolvementHigh + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleSales Representative + TotalWorkingYears
Minimized: cp with value -18.5830550663767
Formula: (response) ~ 1 + EducationCollege + JobInvolvementHigh + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleSales Representative + TotalWorkingYears
Minimized: bic with value -47.0549705141564
Formula: (response) ~ 1 + JobRoleResearch Scientist + JobRoleLaboratory Technician + JobRoleSales Representative
# using exhaustive search for job level 3
allreg.jobLevel3 <- regsubsets(MonthlyIncome ~., data=jobLevel3, nbest=1, really.big = T)
8 linear dependencies found
Reordering variables and trying again:
pp_allreg(allreg.jobLevel3)
Maximized: r2 with value 0.695564396196394
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + EducationMaster + JobRoleLaboratory Technician + JobRoleManager + JobRoleResearch Director + NumCompaniesWorked + RelationshipSatisfactionMedium + StockOptionLevel1 + TotalWorkingYears
Maximized: adjr2 with value 0.682391701801045
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + EducationMaster + JobRoleLaboratory Technician + JobRoleManager + JobRoleResearch Director + NumCompaniesWorked + RelationshipSatisfactionMedium + StockOptionLevel1 + TotalWorkingYears
Minimized: mse with value 147583.376854269
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + EducationMaster + JobRoleLaboratory Technician + JobRoleManager + JobRoleResearch Director + NumCompaniesWorked + RelationshipSatisfactionMedium + StockOptionLevel1 + TotalWorkingYears
Minimized: cp with value -8.9321030170415
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + EducationMaster + JobRoleLaboratory Technician + JobRoleManager + JobRoleResearch Director + NumCompaniesWorked + RelationshipSatisfactionMedium + StockOptionLevel1 + TotalWorkingYears
Minimized: bic with value -209.049030592466
Formula: (response) ~ 1 + JobRoleLaboratory Technician + JobRoleManager + JobRoleResearch Director + StockOptionLevel1 + TotalWorkingYears
# using exhaustive search for job level 4
allreg.jobLevel4 <- regsubsets(MonthlyIncome ~., data=jobLevel4, nbest=1, really.big = T)
10 linear dependencies found
Reordering variables and trying again:
pp_allreg(allreg.jobLevel4)
Maximized: r2 with value 0.817682594713165
Formula: (response) ~ 1 + EducationDoctor + EnvironmentSatisfactionVery High + JobRoleManager + JobRoleResearch Director + JobSatisfactionVery High + MaritalStatusDivorced + MonthlyRate + PerformanceRatingExcellent + RelationshipSatisfactionVery High
Maximized: adjr2 with value 0.800590337967524
Formula: (response) ~ 1 + EducationDoctor + EnvironmentSatisfactionVery High + JobRoleManager + JobRoleResearch Director + JobSatisfactionVery High + MaritalStatusDivorced + MonthlyRate + PerformanceRatingExcellent + RelationshipSatisfactionVery High
Minimized: mse with value 43252.4357777102
Formula: (response) ~ 1 + EducationDoctor + EnvironmentSatisfactionVery High + JobRoleManager + JobRoleResearch Director + JobSatisfactionVery High + MaritalStatusDivorced + MonthlyRate + PerformanceRatingExcellent + RelationshipSatisfactionVery High
Minimized: cp with value -22.2039223919421
Formula: (response) ~ 1 + EducationDoctor + EnvironmentSatisfactionVery High + JobRoleManager + JobRoleResearch Director + JobSatisfactionVery High + MaritalStatusDivorced + MonthlyRate + PerformanceRatingExcellent + RelationshipSatisfactionVery High
Minimized: bic with value -135.961206585552
Formula: (response) ~ 1 + EnvironmentSatisfactionVery High + JobRoleManager + JobRoleResearch Director + JobSatisfactionVery High + MaritalStatusDivorced + MonthlyRate + RelationshipSatisfactionVery High
# using exhaustive search for job level 5
allreg.jobLevel5 <- regsubsets(MonthlyIncome ~., data=jobLevel5, nbest=1, really.big = T)
13 linear dependencies found
Reordering variables and trying again:
pp_allreg(allreg.jobLevel5)
Maximized: r2 with value 0.368029993171166
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + DailyRate + EducationCollege + EducationFieldMarketing + EducationFieldTechnical Degree + EducationFieldHuman Resources + PercentSalaryHike + RelationshipSatisfactionHigh + StockOptionLevel1
Maximized: adjr2 with value 0.271627788739649
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + DailyRate + EducationCollege + EducationFieldMarketing + EducationFieldTechnical Degree + EducationFieldHuman Resources + PercentSalaryHike + RelationshipSatisfactionHigh + StockOptionLevel1
Minimized: mse with value 7727.55606339568
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + DailyRate + EducationCollege + EducationFieldMarketing + EducationFieldTechnical Degree + EducationFieldHuman Resources + PercentSalaryHike + RelationshipSatisfactionHigh + StockOptionLevel1
Minimized: cp with value -51.6575926901194
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently
Minimized: bic with value 2.52520106593577
Formula: (response) ~ 1 + BusinessTravelTravel_Frequently + PercentSalaryHike